The purpose of this pipeline is to provide a set of additional analyses to complement the analysis done by Metabolon. These additional analysis are broken down into four main sections.
Data Exploration
Sub-pathway Analysis
Pairwise comparisons
Plots
img = image_read("../Workflow.png")
image_trim(img)
The R “chunks” separate each of the analysis steps within the main sections of the pipeline. In some chunks, there may need to be lines of code that need to change from experiment to experiment. This pipeline provides a list of steps for each chunk and highlights places where you need to make changes.
To get started, in the Metabolomics pipeline working directory, there are five folders with the following folder structure:
You should save the .xlsx file from Metabolon and any additional metadata files to the Data/Metabolon folder. The main output folder will contain the plots and tables generated from this pipeline. Additionally, you can run this pipeline as a report and save it as a .pdf or .html. The report saves to the “Outputs/Reports” folder.
In its raw form, the peak data contain counts for each metabolite in each sample. Standardization and normalization are important steps to improve the signal-to-noise ratio. In the next section, we are implementing the same standardization and normalization methods used by Metabolon. First we will load the raw peak data into R. To start we:
Provide a path with the file name to the Data tables .xlsx file provided by Metabolon. You may need to change this file path to reflect the name of the data tables .xlsx file for your project.
Read in the “Peak Area Data” tab from the .xlsx file provided above.
# Provide a path to Metabolon .xlsx file.
metabolon_path <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
# 2. Load raw peak data
peak_data <- read.xlsx(metabolon_path, sheet = "Peak Area Data")
The first step in the Metabolon normalization process is median standardization. This step will ensure that each metabolite in the dataset has the same median value. To show this we will show the box plots for 5 metabolites before and after median normalization. Below is the box plots before median standardization where we:
Select the first five metabolites for the box plot.
Create the boxplot data
Plot the box plots for each of the five metabolites.
# 1. Select the first five metabolites for the box plot.
metabolites <- names(peak_data)[2:6]
# 2. Create the boxplot data
plot_data <- peak_data %>%
reshape2::melt(id.vars="PARENT_SAMPLE_NAME") %>%
filter(variable %in% metabolites)
# 3. Plot the box plots for each of the five metabolites.
ggplot(plot_data,aes(x=variable,y=value)) +
geom_boxplot() +
theme_bw()
Now we perform median standardization by:
Initializing a new peak_data_norm matrix
Create a data set containing the median value for each metabolite
Divide each value by the median metabolite value.
# 1. Initialize a new peak_data_med matrix
peak_data_norm <- peak_data
# 2. Create a matrix containing the median value for each metabolite
peak_data_med <- peak_data_norm %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise_all(median, na.rm = T)
# 3. Divide each value for each metabolite by the median value of that metabolite
for(i in colnames(peak_data_med)){
peak_data_norm[,i] <- peak_data_norm[,i]/peak_data_med[,i]
}
Now we can look at those same metabolites from before. We will notice that the median value for all metabolites are now lined up at 1.
# 1. Select the first five metabolites for the box plot.
metabolites <- names(peak_data)[2:6]
# 2. Create the boxplot data
plot_data <- peak_data_norm %>%
reshape2::melt(id.vars="PARENT_SAMPLE_NAME") %>%
filter(variable %in% metabolites)
# 3. Plot the box plots for each of the five metabolites.
ggplot(plot_data,aes(x=variable,y=value)) +
geom_boxplot() +
theme_bw()
Once we standardize each metabolite by the median value, we impute the missing values for each metabolite with the minimum value for that metabolite. We do this in the following steps.
Initialize the new peak_data_imputed matrix
Find the minimum value for each metabolite.
Create a for loop such that if a metabolite has a missing observation, then the minimum value for that metabolite is imputed.
# 1. Initialize the new peak_data_imputed matrix
peak_data_imputed <- peak_data_norm
# 2. Find the minimum value for each metabolite and compute 1/5 of that value
peak_data_mins <- peak_data_imputed %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise_all(min, na.rm = T)
# 3. Impute the value
for(i in colnames(peak_data_mins)){
if(length(peak_data_imputed[,i][is.na(peak_data_imputed[,i])]) > 0){
peak_data_imputed[which(is.na(peak_data_imputed[,i])),i] <- as.numeric(peak_data_mins[i])
}
}
The final step is to take a natural log transformation of each of the values.
# 1. Log transform all of the values
peak_data_log <- peak_data_imputed %>%
mutate_if(is.numeric, log)
Now that we have imputed and normalized our data, we need to save the data for subsequent sections. We will save peak_data_log in the “Data/Processed” Folder under the file name “Log_transformed_data.csv”.
# 1. Save log transformed data.
write.csv(peak_data_log, file = "../Data/Processed/Log_transformed_data.csv",
row.names = F)
For the downstream analysis we will need merge the sample metadata and the log transformed peak data to create one analysis dataset. In the normalization and standardization section we took the raw peak data and performed median value standardization, minimum value imputation and natural log transformation. Metabolon does these preprocessing steps for their customers and provides the log transformed data for their customers in a .xlsx. We recommend utilizing the preprocessed log transformed provided by Metabolon. By utilizing the Log Transformed data derived by Metabolon, we can ensure the down stream supplemental analysis are utilizing the same data as Metabolon.
We need to merge the sample metadata and the Log normalized data together. Additionally, we want to replace the column names with the names of the metabolites. To do this we are going to read in the sample metadata and the log normalized data. We will do this by:
Provide a path to the metabolon .xlsx file.
Read in the sample metadata under the “Sample Meta Data” tab.
Read in the Log transformed data from the “Log Transformed Data” tab.
# 1. Metabolon excel file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
# 2. Read in sample metadata
meta_data <- read.xlsx(met_excel,sheet = "Sample Meta Data")
# 3. Read Log transformed data
peak_data_log <- read.xlsx(met_excel,sheet = "Log Transformed Data")
If there are additional variables that need to be included into the analysis, you can place these variables in a .xlsx file. You must include a variable called “PARENT_SAMPLE_NAME” that links the metadata to each of the samples.
In the following chunk we merge the additional metadata variables to the current metadata. We do this in the following steps.
Provide a path to the additional metadata variables. If you do not have any additional metadata variables, you can set the “additional_meta=NULL”.
If there are additional meta data variables, then merge the additional metadata variables with the metadata from Metabolon.
# 1. Provide path to additional metadata
additional_meta <- "../Data/Metabolon/AdditionalVars.xlsx"
# 2. Merge additional vars to the meta data
if(!is.null(additional_meta)){
meta_data <- read.xlsx(additional_meta) %>%
left_join(meta_data,"PARENT_SAMPLE_NAME")
}
To create the analysis data, we want the metadata and the log transformed peak data merged so each row of the analysis data set is a different sample and the columns contain the sample metadata and the log transformed peak data. To do this we:
Select the metadata variables need for down stream analysis. The metadata provided by Metabolon includes variables which may not be necessary for the analysis. In this section will select the variables in the metadata that are necessary for downstream analysis. To do this we will need to identify the variables we want to keep. This will include the PARENT_SAMPLE_NAME and the experiment group variables. Additionally, if your experiment includes males and females, we recommend including a Sex/Gender variable and stratify the analysis.
Merge the metadata and the log transformed data together while keeping only the metadata variables needed for the analysis.
# 1. Select metadata variables
metadata_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# 2. Create analysis data
analysis_data <- meta_data %>%
select(all_of(metadata_variables)) %>%
left_join(peak_data_log,"PARENT_SAMPLE_NAME")
The goal of this section is to design a table that resembles the study design of your experiment. This step allows us to see descriptive statistics of the samples in our data. For this, we are using the analysis data to define a table with a similar structure as the table below.
| Strat 1 | Strat 2 | |
|---|---|---|
| Var1 | XX | XX |
| Var 2 | XX | XX |
| Var 3 | XX | XX |
Alternatively, we can define a table non-stratified table with only the number of observations within each experimental group. The following code will create the desired table 1 for both scenarios.
In this chunk, we will label the analysis data variable names. Labeling the variable names will allow the tables to display your chosen variable names while maintaining the original variable name within the data. We do this in two steps.
Change the name assigned to var1 to match the name of the variable in the analysis data set which you want to include in the table.
Repeat this for as many variables as needed.
Choose a single variable to stratify the table by.
Assign the variable a label name
# 1. Choose the meta data variable name.
var1 = "GROUP_NAME"
var2 = "TIME1"
# 2. Choose a single variable to stratify the table by.
stratified_var = "Gender"
# 3. Assign the variable a label name
label(analysis_data[,var1]) <- "Treatment Group"
label(analysis_data[,var2]) <- "Time"
# 4. Assign a stratified variable a label name for the table.
label(analysis_data[,stratified_var]) = "Gender"
Now that you have properly assigned the variables, we will create table 1. To create the table, we will be using the table1 function. You must change the arguments depending on what you want the table to display. For example: * Table 1 without interaction:
\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}}, \ \text{data = analysis_data} ) \]
\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}} \underbrace{| \ stratified\_var}_{\text{Statified Variable}}, \ \text{data = analysis_data} ) \] Above is how you change the table1 function to display the table with and without a stratified variable. In the next chunk, we utilize the following steps.
Create table 1. In this step you will need to use the same variable names as in the above chunk within the table1 function.
Display the table
Save the table in the folder “/Outputs/Tables/” with the file name “table1”.
# 1. Creates table1
tbl1 <- table1(~ TIME1 + GROUP_NAME| Gender
, data = analysis_data)
# 2. Displays table 1
tbl1
| Female (N=44) |
Male (N=42) |
Overall (N=86) |
|
|---|---|---|---|
| Time | |||
| End | 14 (31.8%) | 15 (35.7%) | 29 (33.7%) |
| Onset | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| PreSymp | 15 (34.1%) | 13 (31.0%) | 28 (32.6%) |
| Treatment Group | |||
| 1902 | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| Control | 14 (31.8%) | 14 (33.3%) | 28 (32.6%) |
| WT | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
# 3. saves table 1
t1flex(tbl1) %>%
save_as_docx(path = paste0("../Outputs/Tables/","table1.docx"))
Once we have the analysis data, and we are happy with the table 1 we have created we want to save it so we can use it for the downstream analysis. We will save this in the “Data/Processed” folder under the file name “analysis_data.csv”
# Save analysis data
write.csv(analysis_data,"../Data/Processed/analysis_data.csv", row.names = F)
In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods for data exploration:
A.) Principal component analysis
B.) Heat maps
Before diving into the data exploration, we need read the analysis data set. Note, we use check.names=F since column names of the analysis data are numeric.
# 1. Read analysis dataset
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component accounts for the largest possible amount of variance within the data. The second principal component accounts for the largest amount of remaining variance, and so on. Additionally, each of the principal components produced by PCA is uncorrelated with each of the other principal components. At a high level, PCA allows us to visualize sources of variation in the data.
Here, we will graph the first two principal components of our dataset. In the PCA plot, we can label each point based on variables from the metadata.
The following chunk runs PCA on the scale_peak_data matrix in the following steps:
1.Create a vector of non metabolite variables that are in the the analysis data set. These variables include information about the experimental conditions.
Create a data set for the principle component analysis which only includes information on the metabolites.
Run PCA of the pca_dat matrix containing only the metabolites.
Define the plot labels. This is a character string matching the name of one of the variables in the analysis_data which will be used to label points on the PCA graph.
Create and display the PCA plot.
Save the PCA plot in the “Outputs/Plots” folder with the file name PCA.pdf
# 1. Create the non-metabolite vector.
non_metabolite <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# 2. Create PCA data containing only metabolite data
pca_dat <- analysis_data[,!names(analysis_data) %in% non_metabolite]
# 3. Run PCA of the pca_dat matrix containing only the metabolites.
res.pca <- PCA(pca_dat,
graph = F)
# 4. Define labels
meta_var = "Gender"
# 5. Create figure
fviz_pca_ind(res.pca,
label = "none",
habillage = as.factor(analysis_data[,meta_var]))
# 6. Save figure
ggsave(paste0("../Outputs/Plots/","PCA.pdf"), width = 10, height = 10)
Suppose you notice a variable with clearly separated groups, and is not a variable of interest. In that case, you may want to consider stratifying your analysis downstream by the values of that variable. For example, if we are looking at the effects of a specific drug, and we notice in the above plot that the samples are grouped by sex, then we may want to consider stratifying the analysis by male and female.
Another exploratory analysis tool we can use is heatmaps, which is a gridded plot based on the x-axis- and y-axis labels. The color of the spot on the grid is based on the value’s intensity. For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log peak value for each chemical in each observation. We can order our observations to see intensity patterns based on our experimental conditions. This is another way of visualizing sources of variation within our data. To do this, we need to merge the chemical annotation, meta, and scaled peak data. Then we need to arrange the data based on our experimental conditions. The create_heatmap_Data function in the R folder will help us prepare the data for the heatmap. To do this the create_heatmap_Data function takes three arguments.
Analysis data (analysis_data)
Heatmap Variables (heatmap_variables) - defines the variables used to order the samples.
… -
The main utility of create_heatmap data is in (…), which allows you to arrange the columns of the heatmap data. If you are familiar with dplyr, the arrange function orders samples based on the arguments passed into (…). If you are unfamiliar with dplyr, an overview of the arrange function is here.
We will be building three heatmaps, each heatmap will be increasing in complexity. For your expirment you may not need to use all three of these heatmaps. These three heatmaps will be demonstrated with the mouse ALS data. For each heatmap we will look at the top 50 metabolites, which is determined by the highest mean value of the metabolites. By filtering to the top 50 metabolites we will improve our ability to see sources of variation.
In the heatmap below, we select the top 50 metabolites. Once we know the top 50 metabolites with the highest mean value, we can filter the peak_data_log data only to include those metabolites. Then we can run the create_heatmap function to create the heatmap data. In the create_heatmap_data function, we arrange the data by treatment group. To do this the following steps are completed:
Define the heatmap_variables to use in the heatmap
Select the top 50 metabolites from the analysis data based off of mean value.
Filter the analysis_data to only include the top 50 metabolites
Generate heatmap data using the create_heatmap_Data function. In this step we arrange the heatmap data by treatment group. You may need to change the (…) part of the function to reflect the variable names in your data.
Create a palette for the heatmap. By default, we are using a red/blue palette.
Create the vals and the labels that will define the values of the heatmap and the labels of the heatmap.
Display heatmap
Save heatmap in the “Outputs/Plots” folder under the filename “HeatmapTop50Metabolites”
# 1. Define meta analysis variables
heatmap_variables <- c("GROUP_NAME")
# 2. Find the top 50 metabolites
select_variables <- analysis_data %>%
select(-all_of(c("PARENT_SAMPLE_NAME",heatmap_variables))) %>%
summarise(across(everything(),\(x) mean(x,na.rm = T))) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
# 3. Filter to the top 50 metabolites
analysis_data_top50 <- analysis_data %>%
select(all_of(c("PARENT_SAMPLE_NAME", heatmap_variables)), select_variables$name)
# 4. Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data_top50,
heatmap_variables = heatmap_variables ,
GROUP_NAME) # Arranges data frame for heatmap (...)
# 5. Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# 6. Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_variables
# 7. Create heatmap and save.
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA, show_colnames = F)
# 8. Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, show_colnames = F,border_color = NA,filename = paste0("../Outputs/Plots/","HeatmapTop50Metabolites.pdf"))
We can add more complexity to the heatmap by including the blocking variable in the annotation. In the heatmap below the columns of the heatmap are arranged by the additional variable. Here, we follow the following steps.
Define the heatmap_variables to use in the heatmap. For our example this will be GROUP_NAME and TIME1.
Select the top 50 metabolites from the analysis data based off of mean value.
Filter the analysis_data to only include the top 50 metabolites
Generate heatmap data using the create_heatmap_Data function. In this step we arrange the heatmap data by treatment group. You may need to change the (…) part of the function to reflect the variable names in your data. We will be using desc(TIME) to arrange the time of treatment in descending order.
Create a palette for the heatmap. By default, we are using a red/blue palette.
Create the vals and the labels that will define the values of the heatmap and the labels of the heatmap.
Display heatmap
Save heatmap in the “Outputs/Plots” folder under the filename “HeatmapTop50MetabolitesTreatmentandBlock”
# 1. Define meta analysis variables
heatmap_variables <- c("GROUP_NAME",
"TIME1")
# 2. Find the top 50 metabolites
select_variables <- analysis_data %>%
select(-all_of(c("PARENT_SAMPLE_NAME",heatmap_variables))) %>%
summarise(across(everything(),\(x) mean(x,na.rm = T))) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
# 3. Filter to the top 50 metabolites
analysis_data_top50 <- analysis_data %>%
select(all_of(c("PARENT_SAMPLE_NAME", heatmap_variables)), select_variables$name)
# 4. Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data_top50,
heatmap_variables = heatmap_variables ,
GROUP_NAME, desc(TIME1)) # Arranges data frame for heatmap (...)
# 5. Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# 6. Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_variables
# 7. Create heatmap and save.
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA, show_colnames = F)
# 8. Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, show_colnames = F,border_color = NA,filename = paste0("../Outputs/Plots/","HeatmapTop50MetabolitesTreamentandBlock.pdf"))
We may want to stratify this heatmap even farther by adding an additional variable. When we add a third variable, the heatmap may become more challenging to interpret. To overcome this, we recomment stratifying the heatmap by the levels of the third variable. For example, if we want to stratify by gender, we will have two separate heatmaps, one for males and one for females.
Idenify the statified variable and get the unique levels of the stratified variables.
Define the heatmap_variables to use in the heatmap. For our example this will be GROUP_NAME and TIME1.
Select the top 50 metabolites from the analysis data based off of mean value.
Create a heatmap for each level of the stratified variable and store the heatmaps in a list.
Save heatmap plots in the “Outputs/Plots” folder under the filename “HeatmapTop50MetabolitesTreatmentandBlockStratifed”
# 1. Get stratified variable and stratified values
strat_var <- "Gender"
values <- unique(analysis_data[,strat_var])
heatmap_variables <- c("GROUP_NAME",
"TIME1")
# 2. Find the top 50 metabolites
select_variables <- analysis_data %>%
select(-all_of(c("PARENT_SAMPLE_NAME",heatmap_variables))) %>%
summarise(across(everything(),\(x) mean(x,na.rm = T))) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
# Create for loop
maps <- list()
for (i in 1:length(values)) {
# 3. Filter to the top 50 metabolites
analysis_data_top50_i <- analysis_data[analysis_data[,strat_var]==values[i],] %>%
select(all_of(c("PARENT_SAMPLE_NAME", heatmap_variables)), select_variables$name)
# 4. Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data_top50_i,
heatmap_variables = heatmap_variables ,
GROUP_NAME, desc(TIME1)) # Arranges data frame for heatmap (...)
# 5. Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# 6. Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_variables
# 7. Create heatmap and save.
maps[[values[i]]] = pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA, show_colnames = F,
main = values[i], silent = T)
}
# 8. Save heatmaps for all levels
grids <- grid.arrange(grobs= list(maps$Female[[4]],
maps$Male[[4]]))
ggsave(filename = "../Outputs/Plots/HeatmapTop50MetabolitesTreamentandBlockStratified.pdf", grids)
## Saving 12 x 12 in image
Before diving in, lets load the matrices that we will need for the Sub-pathway Analysis section. The two matrices we will need are:
Analysis data
Chemical annotation data (chem_data): Contains all of the information about the metabolites, including which sub-pathway and super-pathway they belong to.
# 1. Path to Metabolon .xlsx file.
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
# 2. Analysis Data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# 3. Read chemical annotations
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
In the chemical annotation file, we will see that each metabolite is within a sub-pathway, and each sub-pathway is within a super-pathway. There are several metabolites within each sub-pathways and several sub-pathways within each Super-pathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level, which is already part of the Metabolon analysis. However, since multiple metabolites are within a sub-pathway, it is challenging to test if the treatment affected the peak data at the sub-pathway level. For this, we will be utilizing a combined fisher test. The combined Fisher test combines the p-values for each metabolite within a sub-pathway to test the same hypothesis at the sub-pathway level.
To test our hypothesis at the sub-pathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.
1.) Interaction: \(log Peak = Treatment + block + Treatment*block\)
2.) Parallel: \(log Peak = Treatment + block\)
3.) Single: \(log Peak = Treatment\)
For the interaction model, we are focusing only on the interaction term “Treatment*block” for this to test if there is a significant interaction between our treatment and the block variable. The parallel model is testing if the block variable is explaining a significant amount of the metabolite variance, and the treatment model is testing if the treatment explains a significant proportion of the variance for each metabolite.
After running these three models for each metabolite, we will test at the sub-pathway level by combining the p-values for each metabolite within the sub-pathway for each model. We compute a chi-squared statistic to test at the sub-pathway level. For each model, we compute the chi-squared statistic by:
\[ \tilde{X} = -2\sum_{i=1}^k ln(p_i) \] where \(k\) is the number of metabablites in the sub-pathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\).
Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,
Independence: Each observation is independent of all other observations. Therefore, if you have collected multiple samples from the same subject then this assumption may be violated.
Normality: The metabolite log-scaled intensities follow a normal distribution within each of the treatment groups.
Homoscedasticity: Equal variance between treatment groups.
In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumptions for each metabolite being tested within the sub-pathway. For example, when we test the interaction model at the sub-pathway level, the p-values for interaction test are independent for each metabolite within the sub-pathway.
Now that we have our analysis data, we test the metabolomic data at
the sub-pathway level utilizing the pathway_analysis
function defined in the R script. This function takes the following
arguments:
Ananlysis data (analysis_data): This is a a data frame created in the “Analysis data creation” module which combines the metadata and the peak data into one analysis data set.
Chemical Annotation data (chem_data) provided by Metabolon.
Treatment variable (treat_var): This is the name of the variable in the analysis data that is the main variable of interest.
Block Variable (block_var): Is the name of the blocking variable in the dataset. If the the experimental design does not include a blocking variable, then the value of block_var=NULL.
Note: Metabolites with a large number of missing values may have unreliable results since the missing values were imputed with the minimum of the non-missing values of the metabolite. The warning message from the subpathway_analysis function will display the metabolite that produced the warning message.
To run this analysis, we utilize the following steps.
Run the subpathway_analysis function.
Save the pathway analysis results to the “Data/Results” folder with the name “NonStratified_Full_pathway_results”
# 1. Run the subpathway function
path_dat <- subpathway_analysis(analysis_data,
chem_data = chem_data,
block_var = "TIME1",
treat_var = "GROUP_NAME")
# 2. Save results
write.csv(path_dat,file = "../Data/Results/NonStratified_Full_subpathway_results.csv", row.names = F)
We may notice that we need to stratify our analysis if we believe the effects of the model are different between the levels of the stratified variable. For example, we may notice that sex is a dominant source of variation in the metabolite data, and wemay want to look at males and females separately. One way to do this is to subset the data prior to running the subpathway_analysis function.
In the following chunk you can specify a variable to stratify by. For each level of the stratified variable, we:
1.) Subset the analysis data based on the stratified variable.
2.) Run the subpathay_analysis function with the subsetted data.
3.) Save the stratified data in “Data/Results” as “analysis_data_{Val}” where {Val} is the value of the stratified variable. For example, if we stratified by sex then the data will be saved as “analysis_data_female for the females.
4.) Save the results in “Data/Results” and the file will be saved as “Statified_{Val}_subpathway_results.csv” where {Val} is the value of the variable we stratified by. For example, if we stratified by sex, then the results for females would be saved as “Stratified_female_subpathway_results.csv”
Note: Metabolites with a large number of missing values may have unreliable results since the missing values were imputed with the minimum of the non-missing values of the metabolite. The warning message from the subpathway_analysis function will display the metabolite that produced the warning message.
# 1. Name of the variable to stratify by.
stratified_var = "Gender"
# 3. Find unique values of this variable
uni_vals <- unique(analysis_data[,stratified_var])
# For each value perform the four steps listed above.
for(i in uni_vals){
print(paste0("Running the ", i, " strata."))
# 1. Subset analysis data
strat_data = analysis_data[analysis_data[,stratified_var]==i,]
# 2. Run subpathway function on the subsetted data
strat_path_results <- subpathway_analysis(strat_data,
chem_data = chem_data,
block_var = "TIME1",
treat_var = "GROUP_NAME")
# 3 Save stratified data
write.csv(strat_data, paste0("../Data/Processed/analysis_data_",i,".csv"), row.names = F)
# 4 Save results
write.csv(strat_path_results, paste0("../Data/Results/Stratified_",i,"_subpathway_results.csv"), row.names = F)
}
## [1] "Running the Female strata."
## [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
## [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
## [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
## [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
## [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
## [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
## [1] "Running the Male strata."
## [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
## [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
## [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
## [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
## [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
## [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
## [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
## [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
## [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
Once we have the Overall Analysis results, we will have tested all metabolites with an interaction, parallel and Single model. We then obtain a p-values through the combined fisher analysis for each sub-pathway. In the next few chunks, we will summarize the results with a few different tables. The first summary will show the number of significant sub-pathways for each model type. To do this we:
List the paths to the results files generated in the previous chunks.
Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.
Create a for loop that does the following:
Read in the results table from the path defined in step 1.
Structure the names of the levels for the model types in the table.
Create the table data used to count the number of significant sub-pathways for each model type.
Create the table by counting the number of sub-pathways for each model type.
Display the table
Save the table in “Outputs/Tables” folder under the file name “NumberOfSigPathwaysByModelType_{{strata}}”. You may want to change this name to reflect the strata you are summarizing.
Note: If you only tested the sub-pathways for the Single model then only the single model is displayed.
Note: The model type for each sub-pathway is determined hiearchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.
# 1. Read in Results from the analysis step.
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
"../Data/Results/Stratified_Female_subpathway_results.csv")
# 2. Define the names of the strata
strata <- c("Male", "Female")
for (i in 1:length(strata)) {
# 3. Read in results data
path_data <- read.csv(results_path[i], check.names = F)
# 4. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
}
# 5. Create table data
table_data <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct()
# 6. Create table
sig_subPaths <- count(table_data, model) %>%
arrange((model)) %>%
flextable() %>%
set_header_labels(model = "Model Type", n="Count")%>%
theme_vanilla() %>%
set_table_properties(layout = "autofit") %>%
set_caption(caption = paste0("Sigificant Pathways by Model (",
strata[i],")."))
# 7. Display table
cat(knitr::knit_print(sig_subPaths))
# 8. Save table
save_as_docx(sig_subPaths,path=paste0(
"../Outputs/Tables/NumberOfSigPathwaysByModelType_",strata[i],".docx"))
}
Model Type | Count |
|---|---|
Interaction | 84 |
Parallel | 12 |
Single | 2 |
None | 12 |
Model Type | Count |
|---|---|
Interaction | 6 |
Parallel | 20 |
Single | 19 |
None | 65 |
Additionally, we can summarize the results by looking at the number of significant sub-pathways within a super-pathway for each model type. This summary is done by:
List the paths to the results files generated in the previous chunks.
Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.
Create a for loop that does the following:
Read the results for the given strata.
Structure the levels for the model types based on the types of model ran.
Create the table data that has the combined fisher p-values for all models and the model type.
Formulating the superpath data by:
Filtering all subpathways that did not have a significant subpathway.
Join the results with the chemical annoation data to group by the superpathway.
Summarize the percent of subpathways within superpathways that are significant.
Display table
Save table in the “Outputs/Tables” folder under the name “SigSuperPathwayPecentages_{{strata}}”
Note: The model type for each sub-pathway is determined hierarchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.
# 1. Read in Results from the analysis step.
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
"../Data/Results/Stratified_Female_subpathway_results.csv")
# 2. Define the names of the strata
strata <- c("Male", "Female")
for(i in 1:length(strata)){
# 3 Read results for the strata
path_data <- read.csv(results_path[i], check.names = F)
# 4. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 5. Create table data
table_data <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct()
# 6. Formulate the Super-pathway results table.
superPath <- table_data %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, pval) %>%
right_join(chem_data %>% select(SUPER_PATHWAY, SUB_PATHWAY) %>% distinct(),
by = c("sub_pathway" = "SUB_PATHWAY")) %>%
filter(!is.na(sub_pathway)) %>%
mutate(sig = ifelse(is.na(pval), 0, 1)) %>%
group_by(SUPER_PATHWAY) %>%
summarise(percent_significant = round(mean(sig) * 100, 2),
number_significant = sum(sig),
pathway_count = n()) %>%
ungroup() %>% arrange(-percent_significant) %>%
transmute(SUPER_PATHWAY, percent_significant = paste0(number_significant, " / ", pathway_count,
" (", percent_significant, "%)")) %>%
flextable() %>%
set_header_labels(SUPER_PATHWAY="Super Pathway", percent_significant = "Percent Significant") %>%
set_table_properties(layout = "autofit") %>%
set_caption(caption = paste0("Proportion of significant subpathways within super-pathways (",
strata[i],").")) %>%
theme_vanilla()
# 7. Display table
cat(knitr::knit_print(superPath))
# 8. Save table
save_as_docx(superPath,path = paste0("../Outputs/Tables/SigSuperPathwayPecentages_",strata[i],".docx"))
}
Super Pathway | Percent Significant |
|---|---|
Amino Acid | 16 / 16 (100%) |
Energy | 2 / 2 (100%) |
Nucleotide | 8 / 8 (100%) |
Partially Characterized Molecules | 1 / 1 (100%) |
Xenobiotics | 5 / 5 (100%) |
Carbohydrate | 7 / 8 (87.5%) |
Lipid | 46 / 53 (86.79%) |
Cofactors and Vitamins | 9 / 11 (81.82%) |
Peptide | 3 / 5 (60%) |
Super Pathway | Percent Significant |
|---|---|
Xenobiotics | 5 / 5 (100%) |
Amino Acid | 13 / 16 (81.25%) |
Peptide | 3 / 5 (60%) |
Nucleotide | 3 / 8 (37.5%) |
Lipid | 16 / 53 (30.19%) |
Carbohydrate | 2 / 8 (25%) |
Cofactors and Vitamins | 2 / 11 (18.18%) |
Energy | 0 / 2 (0%) |
Partially Characterized Molecules | 0 / 1 (0%) |
The next table we create will show all of the significant sub-pathways and their model type. We do this by.
List the paths to the results files generated in the previous chunks.
Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.
Create a for-loop that does the following:
Reads in the results from the path listed
Structure model type levels.
Create pathway data that contains the sub-pathway, model type and the p-value for the model type with the the pathways that were non-significant filtered out.
Format table
Display table
Save table in the “Outputs/Tables” folder with the name “SigSubpathwayTable_{{strata}}”
# 1. Read in Results from the analysis step.
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
"../Data/Results/Stratified_Female_subpathway_results.csv")
# 2. Define the names of the strata
strata <- c("Male", "Female")
# Create for loop
for (i in 1:length(strata)){
# 3 Read Path data
path_data <- read.csv(results_path[i])
# 4. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 5. Create pathway table
pathway_table <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct() %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, model, pval) %>%
arrange(model, pval)
# 6. Format table
path_tab = pathway_table %>%
flextable() %>%
set_header_labels(sub_pathway="Subpathway", model = "Model Type", pval = "P-value") %>%
set_table_properties(layout = "autofit") %>%
colformat_double(digits = 3) %>%
theme_vanilla() %>%
set_caption(caption = paste0("Significant Subpathways (",
strata[i],")."))
# 7. Display table
cat(knitr::knit_print(path_tab))
# Save table
save_as_docx(path = paste0("../Outputs/Tables/SigSubpathwayTable_",strata[i],".docx"))
}
Subpathway | Model Type | P-value |
|---|---|---|
Glutamate Metabolism | Interaction | 0.000 |
Polyamine Metabolism | Interaction | 0.000 |
Nicotinate and Nicotinamide Metabolism | Interaction | 0.000 |
Fatty Acid, Dihydroxy | Interaction | 0.000 |
Tryptophan Metabolism | Interaction | 0.000 |
Lysine Metabolism | Interaction | 0.000 |
TCA Cycle | Interaction | 0.000 |
Purine Metabolism, Guanine containing | Interaction | 0.000 |
Leucine, Isoleucine and Valine Metabolism | Interaction | 0.000 |
Glycolysis, Gluconeogenesis, and Pyruvate Metabolism | Interaction | 0.000 |
Primary Bile Acid Metabolism | Interaction | 0.000 |
Purine Metabolism, (Hypo)Xanthine/Inosine containing | Interaction | 0.000 |
Long Chain Polyunsaturated Fatty Acid (n3 and n6) | Interaction | 0.000 |
Medium Chain Fatty Acid | Interaction | 0.000 |
Methionine, Cysteine, SAM and Taurine Metabolism | Interaction | 0.000 |
Chemical | Interaction | 0.000 |
Purine Metabolism, Adenine containing | Interaction | 0.000 |
Urea cycle; Arginine and Proline Metabolism | Interaction | 0.000 |
Alanine and Aspartate Metabolism | Interaction | 0.000 |
Sterol | Interaction | 0.000 |
Phospholipid Metabolism | Interaction | 0.000 |
Creatine Metabolism | Interaction | 0.000 |
Secondary Bile Acid Metabolism | Interaction | 0.000 |
Sphingolipid Synthesis | Interaction | 0.000 |
Gamma-glutamyl Amino Acid | Interaction | 0.000 |
Food Component/Plant | Interaction | 0.000 |
Fatty Acid, Dicarboxylate | Interaction | 0.000 |
Long Chain Saturated Fatty Acid | Interaction | 0.000 |
Pyrimidine Metabolism, Orotate containing | Interaction | 0.000 |
Long Chain Monounsaturated Fatty Acid | Interaction | 0.000 |
Vitamin A Metabolism | Interaction | 0.000 |
Pyrimidine Metabolism, Uracil containing | Interaction | 0.000 |
Pentose Metabolism | Interaction | 0.000 |
Pyrimidine Metabolism, Cytidine containing | Interaction | 0.000 |
Tocopherol Metabolism | Interaction | 0.000 |
Endocannabinoid | Interaction | 0.000 |
Aminosugar Metabolism | Interaction | 0.000 |
Fatty Acid, Monohydroxy | Interaction | 0.000 |
Glycerolipid Metabolism | Interaction | 0.000 |
Ceramides | Interaction | 0.000 |
Phosphatidylethanolamine (PE) | Interaction | 0.000 |
Phosphatidylinositol (PI) | Interaction | 0.000 |
Phosphatidylcholine (PC) | Interaction | 0.000 |
Sphingomyelins | Interaction | 0.000 |
Dipeptide | Interaction | 0.000 |
Monoacylglycerol | Interaction | 0.000 |
Lysoplasmalogen | Interaction | 0.000 |
Lysophospholipid | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Long Chain Saturated) | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Medium Chain) | Interaction | 0.000 |
Ascorbate and Aldarate Metabolism | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Glycine) | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Monounsaturated) | Interaction | 0.000 |
Partially Characterized Molecules | Interaction | 0.000 |
Hexosylceramides (HCER) | Interaction | 0.000 |
Fatty Acid, Branched | Interaction | 0.000 |
Diacylglycerol | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Polyunsaturated) | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Hydroxy) | Interaction | 0.000 |
Fatty Acid, Amino | Interaction | 0.000 |
Fatty Acid Metabolism (Acyl Carnitine, Dicarboxylate) | Interaction | 0.000 |
Plasmalogen | Interaction | 0.000 |
Dihydrosphingomyelins | Interaction | 0.000 |
Lactoyl Amino Acid | Interaction | 0.000 |
Unknown Metabolite | Interaction | 0.000 |
Pentose Phosphate Pathway | Interaction | 0.000 |
Glycosyl PE | Interaction | 0.002 |
Tyrosine Metabolism | Interaction | 0.010 |
Histidine Metabolism | Interaction | 0.010 |
Hemoglobin and Porphyrin Metabolism | Interaction | 0.010 |
Sphingosines | Interaction | 0.010 |
Glycine, Serine and Threonine Metabolism | Interaction | 0.010 |
Glutathione Metabolism | Interaction | 0.010 |
Vitamin B6 Metabolism | Interaction | 0.010 |
Thiamine Metabolism | Interaction | 0.010 |
Fructose, Mannose and Galactose Metabolism | Interaction | 0.010 |
Phenylalanine Metabolism | Interaction | 0.020 |
Fatty Acid Metabolism (also BCAA Metabolism) | Interaction | 0.020 |
Disaccharides and Oligosaccharides | Interaction | 0.033 |
Ketone Bodies | Interaction | 0.034 |
Mevalonate Metabolism | Interaction | 0.040 |
Riboflavin Metabolism | Interaction | 0.040 |
Tetrahydrobiopterin Metabolism | Interaction | 0.040 |
Guanidino and Acetamido Metabolism | Interaction | 0.040 |
Drug - Topical Agents | Parallel | 0.000 |
Pyrimidine Metabolism, Thymine containing | Parallel | 0.000 |
Dihydroceramides | Parallel | 0.000 |
Eicosanoid | Parallel | 0.000 |
Benzoate Metabolism | Parallel | 0.000 |
Phosphatidylserine (PS) | Parallel | 0.000 |
Dipeptide Derivative | Parallel | 0.000 |
Docosanoid | Parallel | 0.001 |
Glycogen Metabolism | Parallel | 0.002 |
Purine and Pyrimidine Metabolism | Parallel | 0.008 |
Inositol Metabolism | Parallel | 0.010 |
Drug - Other | Parallel | 0.040 |
Phosphatidylglycerol (PG) | Single | 0.001 |
Oxidative Phosphorylation | Single | 0.026 |
Subpathway | Model Type | P-value |
|---|---|---|
Purine Metabolism, (Hypo)Xanthine/Inosine containing | Interaction | 0.000 |
Pentose Phosphate Pathway | Interaction | 0.004 |
Glycogen Metabolism | Interaction | 0.011 |
Food Component/Plant | Interaction | 0.020 |
Lactoyl Amino Acid | Interaction | 0.040 |
Drug - Other | Interaction | 0.040 |
Nicotinate and Nicotinamide Metabolism | Parallel | 0.000 |
Primary Bile Acid Metabolism | Parallel | 0.000 |
Chemical | Parallel | 0.000 |
Histidine Metabolism | Parallel | 0.000 |
Hemoglobin and Porphyrin Metabolism | Parallel | 0.000 |
Secondary Bile Acid Metabolism | Parallel | 0.000 |
Sphingolipid Synthesis | Parallel | 0.000 |
Eicosanoid | Parallel | 0.000 |
Benzoate Metabolism | Parallel | 0.000 |
Acetylated Peptides | Parallel | 0.000 |
Phosphatidylserine (PS) | Parallel | 0.000 |
Hexosylceramides (HCER) | Parallel | 0.000 |
Diacylglycerol | Parallel | 0.000 |
Unknown Metabolite | Parallel | 0.000 |
Drug - Topical Agents | Parallel | 0.010 |
Polyamine Metabolism | Parallel | 0.020 |
Phospholipid Metabolism | Parallel | 0.030 |
Glutamate Metabolism | Parallel | 0.040 |
Urea cycle; Arginine and Proline Metabolism | Parallel | 0.040 |
Glycine, Serine and Threonine Metabolism | Parallel | 0.040 |
Tryptophan Metabolism | Single | 0.000 |
Lysine Metabolism | Single | 0.000 |
Leucine, Isoleucine and Valine Metabolism | Single | 0.000 |
Phenylalanine Metabolism | Single | 0.000 |
Methionine, Cysteine, SAM and Taurine Metabolism | Single | 0.000 |
Purine Metabolism, Adenine containing | Single | 0.000 |
Tyrosine Metabolism | Single | 0.000 |
Fatty Acid, Dicarboxylate | Single | 0.000 |
Fatty Acid Metabolism (also BCAA Metabolism) | Single | 0.000 |
Dipeptide | Single | 0.000 |
Lysoplasmalogen | Single | 0.000 |
Plasmalogen | Single | 0.000 |
Inositol Metabolism | Single | 0.010 |
Pyrimidine Metabolism, Uracil containing | Single | 0.010 |
Docosanoid | Single | 0.015 |
Guanidino and Acetamido Metabolism | Single | 0.020 |
Lysophospholipid | Single | 0.020 |
Dipeptide Derivative | Single | 0.030 |
Fatty Acid Metabolism (Acyl Glycine) | Single | 0.040 |
Now we will looks at the metabolites within significant sub-pathways, and the p-values for the significant model. In this section we will look at all of the metabolites within the top two sub-pathways based on p-values. To do this we will:
List the paths to the results files generated from the sub-pathway analysis.
Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.
Name the model type that we are interested in looking at. The options are “Interaction”, “Parallel” or “Single”. This is case sensitive and must match the options.
Create a for-loop that does the following:
Read in the sub-pathways analysis results for the given strata.
Structure the levels for the model type in the results.
Create the data used for the tables which includes sub-pathway, the fisher p-values, and the model type for the sub-pathway.
Create the top sub-pathways data which will group by model type and order by the sub-pathways based on p-value. In this code we are only going to focus on the top two sub-pathways. If you want to focus on more sub-pathways you can edit the “slice” function.
Create a list of the top two sub-pathways for the specified model type.
Create a second for loop which:
Create the table which shows all the metabolite p-values for the model type.
Diplay the table
Save the table in “Outputs/Tables” with the file name “Metabolite_model_pvalues_{{pathway}}_{{strata}}”
# 1. Read in Results from the analysis step.
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
"../Data/Results/Stratified_Female_subpathway_results.csv")
# 2. Define the names of the strata
strata <- c("Male", "Female")
# 3 Model type (Interaction, Parallel, Single)
mod = "Interaction"
# Create for loop
for(i in 1:length(strata)){
# 4. Read in table data and results from overall analysis.
path_data <- read.csv(results_path[i], check.names = F)
# 5. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 6. Create table data
table_data <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct()
# 7. Create top pathways table
top_pathway_table <- table_data %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, model, pval) %>%
arrange(model, pval) %>%
filter(model == mod) %>%
group_by(model) %>% slice(1:2) %>% ungroup() %>%
left_join(select(path_data, sub_pathway, chem_name, ends_with("_pval")),
by = "sub_pathway") %>%
distinct()
# 8. list the top two subpathways.
pathways <- unique(top_pathway_table$sub_pathway)
# Create pathways for look
for(j in 1:length(pathways)){
# 9. Create table for specific model type
top_paths_table <- top_pathway_table %>%
filter(sub_pathway == pathways[j]) %>%
select(chem_name, all_of(paste0(tolower(mod),"_pval"))) %>%
arrange(all_of(paste0(tolower(mod),"p_val"))) %>%
flextable() %>%
set_header_labels(chem_name="Metabolite") %>%
set_table_properties(layout = "autofit") %>%
add_header_row(values = c(pathways[j]),colwidths = 2) %>%
theme_vanilla() %>%
align(i=1,align = "center", part = "header") %>%
colformat_double(digits = 3) %>%
set_caption(caption = paste0("Metabolites Within Pathways of Significant ",mod, " Model(",
strata[i],")."))
# 10. Display table
cat(knitr::knit_print(top_paths_table))
# 11. Save table
save_as_docx(top_pathway_table,
path = paste0("../Outputs/Tables/Metabolite_model_pvalues_",gsub("[^A-Za-z0-9 ]","",pathways[j]),"_",strata[i],".docx"))
}
}
Glutamate Metabolism | |
|---|---|
Metabolite | interaction_pval |
S-1-pyrroline-5-carboxylate | 0.356 |
glutamate | 0.233 |
glutamine | 0.023 |
N-acetylglutamate | 0.091 |
N-acetylglutamine | 0.091 |
N-acetyl-aspartyl-glutamate (NAAG) | 0.103 |
alpha-ketoglutaramate* | 0.075 |
4-hydroxyglutamate | 0.248 |
N-methyl-GABA | 0.076 |
carboxyethyl-GABA | 0.061 |
Polyamine Metabolism | |
|---|---|
Metabolite | interaction_pval |
putrescine | 0.181 |
spermidine | 0.061 |
N-acetylputrescine | 0.042 |
5-methylthioadenosine (MTA) | 0.131 |
spermine | 0.091 |
4-acetamidobutanoate | 0.282 |
(N(1) + N(8))-acetylspermidine | 0.034 |
Purine Metabolism, (Hypo)Xanthine/Inosine containing | |
|---|---|
Metabolite | interaction_pval |
hypoxanthine | 0.006 |
allantoic acid | 0.250 |
inosine | 0.001 |
inosine 5'-monophosphate (IMP) | 0.859 |
allantoin | 0.316 |
xanthine | 0.021 |
urate | 0.667 |
xanthosine | 0.961 |
N1-methylinosine | 0.941 |
Pentose Phosphate Pathway | |
|---|---|
Metabolite | interaction_pval |
ribose 1-phosphate | 0.004 |
At the metabolite level, we can look at the pairwise comparisons for all experimental groups. In the following section we will look at the pairwise comparisons using the metabolite_pairwise function defined in the R scripts. We will first read in the analysis data needed for this analysis. Since the column names of the analysis data are numeric, we should convert all of the column names to to characters.
# 1. Read in data from the analysis above
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# 2. Make the variable names charactors
names(analysis_data) <- as.character(names(analysis_data))
Once we have the analysis data loaded in, we can now look at the pairwise comparisons for each of our experimental groups. To do this we will be using the metabolite_pairwise function defined in the R scripts. This function will analyze each metabolite individually. For each metabolite, the the metabolite_pairwise function will first test if any of the experimental groups explained a significant proportion of the variance in the metabolite using an F-test. Since we will be looking at multiple comparisons for the metabolite it is good practice to first look at the over-all p-value from the F-test before looking at the pairwise comparisons.
The metabolite_pairwise function then looks at all pairwise comparisons utilizing the emmeans package. The metabolite_pairwise function returns a data frame with the metabolite, overall p-value, estimated difference in means for all experimental groups, and the p-value which test if the difference the groups is significantly different.
The metabolite_pairwise function has three arguments:
form: This is a character string the resembles the right hand side of a simple linear regression model in R. For example form = “Group1 + Group2”.
data: The analysis data we will use for the pairwise comparisons.
Metabolites: A list of metabolites that we want to be analyzed.
In the next chunk we will stratify the analysis by running the metabolite_pairwise function for each stata in our analysis data set. To do this we:
Define the variable in the analysis data that we want to stratify our analysis by.
Define the form variable for the metabolite_pairwise function.
Create a list of metabolites that we want to pairwise analysis for. By default, we are going to use all metabolites in the analysis data.
Create a for look that:
Filters the analysis_data to focus on a single strata.
Run the metabolite_pairwise function for that strata.
Save results in the “../Data/Results/Pairwise_Comparisons” folder with the file name “Pairwise_{{strata}}.
# 1. Define the variable in the analysis data that we want to stratify
strata_var <- "Gender"
# 2. Define the form variable
form <- "GROUP_NAME*TIME1"
# Create a list of metabolites
analysis_variables <- c("PARENT_SAMPLE_NAME", "GROUP_NAME",
"TIME1", "Gender")
metabolites <- names(analysis_data)[!names(analysis_data) %in% analysis_variables ]
stratas <- unique(analysis_data[,strata_var])
for (i in 1:length(stratas)) {
# 4. Filter analysis data
dat <- analysis_data[analysis_data[,strata_var]==stratas[i],]
# 5. Run the metabolite_pairwise function for the strata
strat_results <- metabolite_pairwise(form = form,
data=dat,
metabolites = metabolites)
# 6. Save results
write.csv(strat_results, file = paste0("../Data/Results/Pairwise_Comparisons/Pairwise_",
stratas[i],".csv"), row.names = F)
}
## ================================================================================
## ================================================================================
For the metabolites which had a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis), we will produce a heatmap of the log fold changes. This heatmap colors will only show if the log fold-change is greater than log(2) or less than log(.5). Therefore, this heatmap will only focus on comparisons with a fold change of two or greater. To do this we will,
Read in the results from the pairwise comparisons. If you stratified the analysis, you may want to copy this chunk and rerun for each strata. Additionally, we will need to read in the chemical annotation file. When we read in the results from the pairwise comparisons, we will filter the results to only look at the metabolites with a significant overall p-value.
Merge the chemical annotation fill with the results from the pairwise comparisons.
Create the heatmap. In this step we are doing some pre-precessing which removes the p-value columns and then transforms the remaining fold change results from wide to long format. We then mutate the logFoldChanges to only display when the fold change is greater than 2 or less than .5. Then we use plotly to create an interactive heatmap.
Display heatmap
Save heatmap in “…/Outputs/Plots” with the file name “PaiwiseComp_EST_heatmap”. Since this plot is interactive, it will be save as an html file that can be opened in a web browser.
# 1. Read in pairwise results
results_pair <-read.csv("../Data/Results/Pairwise_Comparisons/Pairwise_Female.csv"
, check.names = F) %>%
filter(Overall_pval < 0.05)
# Read chemical annotation file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 2. Merge the chemical annotation fill with the results from the pairwise comparisons.
data <- chem_data %>%
select(SUB_PATHWAY,CHEMICAL_NAME,CHEM_ID) %>%
merge(results_pair, by.x = "CHEM_ID",by.y = "Metabolite") %>%
arrange(SUB_PATHWAY)
# 3. Produce Heatmap
p = data %>%
select(CHEM_ID,SUB_PATHWAY,CHEMICAL_NAME, all_of(names(data)[grepl("EST",names(data))])) %>%
melt(id.vars = c("CHEM_ID","SUB_PATHWAY","CHEMICAL_NAME"), variable.name = "Contrast",
value.name = "logFoldChange") %>%
mutate(Contrast = gsub("_ESTS","",Contrast),
logFoldChange = ifelse(logFoldChange < log(0.5) | logFoldChange > log(2),
round(logFoldChange,3), NA)) %>%
plot_ly(
type = "heatmap",
x= ~Contrast,
y = ~CHEMICAL_NAME,
z = ~logFoldChange,
text = ~SUB_PATHWAY,
hovertemplate = paste("<b>Metabolite: %{y}</b><br><br>",
"Sub-pathway: %{text}<br>",
"Contrast: %{x}<br>",
"Log Fold-Change: %{z}<br>",
"<extra></extra>"),
colorbar = list(title ="<b>Log Fold-Change</b>")) %>%
layout(title = "<b>Log Fold-Change Heatmap</b>",
xaxis = list(title="<b>Contrasts</b>"),
yaxis = list(title = ""))
# 4. Display heatmap
p
# 5. Save.
#htmlwidgets::saveWidget(p,paste0("../Outputs/Plots/PaiwiseComp_EST_heatmap_",".html"))
For the metabolites which had a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis), we will now produce a heatmap based on the pairwise comparison p-values. To do this we will:
Read in the results from the pairwise comparisons. If you stratified the analysis, you may want to copy this chunk and rerun for each strata. Additionally, we will need to read in the chemical annotation file. When we read in the results from the pairwise comparisons, we will filter the results to only look at the metabolites with a significant overall p-value.
Merge the chemical annotation fill with the results from the pairwise comparisons.
Create the heatmap. In this step we are doing some pre-precessing which removes the log fold change columns and then transforms the remaining p-value results from wide to long format. Then we use plotly to create an interactive heatmap.
Display heatmap
Save heatmap in “…/Outputs/Plots” with the file name “PaiwiseComp_p_value_heatmap”. Since this plot is interactive, it will be save as an html file that can be opened in a web browser.
# 1. Read in pairwise results
results_pair <-read.csv("../Data/Results/Pairwise_Comparisons/Pairwise_Female.csv"
, check.names = F) %>%
filter(Overall_pval < 0.05)
# Read chemical annotation file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 2. Merge the chemical annotation fill with the results from the pairwise comparisons.
data <- chem_data %>%
select(SUB_PATHWAY,CHEMICAL_NAME,CHEM_ID) %>%
merge(results_pair, by.x = "CHEM_ID",by.y = "Metabolite") %>%
arrange(SUB_PATHWAY)
# 3. Produce Heatmap
p = data %>%
select(CHEM_ID,SUB_PATHWAY,CHEMICAL_NAME, all_of(names(data)[grepl("PVALS",names(data))])) %>%
melt(id.vars = c("CHEM_ID","SUB_PATHWAY","CHEMICAL_NAME"), variable.name = "Contrast",
value.name = "P_value") %>%
mutate(Contrast = gsub("_PVALS","",Contrast),
P_value = round(P_value,3)) %>%
arrange(SUB_PATHWAY) %>%
plot_ly(
type = "heatmap",
x= ~Contrast,
y = ~CHEMICAL_NAME,
z = ~P_value,
text = ~SUB_PATHWAY,
hovertemplate = paste("<b>Metabolite: %{y}</b><br><br>",
"Sub-pathway: %{text}<br>",
"Contrast: %{x}<br>",
"P-Value: %{z}<br>",
"<extra></extra>"),
colorbar = list(title ="<b>P-value</b>")) %>%
layout(title = "<b>P-Value Heatmap</b>",
xaxis = list(title="<b>Contrasts</b>"),
yaxis = list(title = ""))
# 4. Display heatmap
p
# 5. Save.
#htmlwidgets::saveWidget(p,paste0("../Outputs/Plots/PaiwiseComp_p_value_heatmap_",".html"))
Visualizations of the data can help us see the underlying trends. Two useful visualization is boxplots and line graphs. Metabolon provides customers with similar box plots, but the following will allow to look at the boxplots specifically with a sub-pathway.
Suppose there is a specific subpathway that we would like to visualize. In that case, we can compare the treatment groups for each metabolite within that sub-pathway using the subpathway_box plots function defined in the R script. T his function takes the following arguments.
Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the sub pathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Pathway (pathway): This is the name of the sub-pathway of interest.
X: This the the name of the variable in the meta data that is used for the X axis of the box plots.
Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.
… : At the end of this function you can provide additional arguments for filtering the analysis data. In the examples below we are filtering so the box plots only focus on males and only two treatment groups.
The output of the subpathway_boxplot function is box plots for each of the metabolites within the specified subpathway. The labels for the box plots are default based on the data created inside the subpathway_boxplots function. Since these box plots utilize ggplot, we can customize all labels using the. labs function from ggplot.
Here we look at the boxplots for each metabolite within a specified subpathway utilizing the subpathway_boxplots function. The subpathway_boxplots function takes the following arguments
# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# 2. Reach in chemical annotation data
# Read chemical annotation file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 3. Specify subpathway of interest.
subpath = "Gamma-glutamyl Amino Acid"
# 4. Run the subpathway_boxplots function
box_1 <- subpathway_boxplots(analysis_data = analysis_data,
chem_data = chem_data,
subpathway = subpath,
X = TIME1,
groupBy = GROUP_NAME,
# Additional analysis data filtering options.
Gender == "Male", GROUP_NAME != "Control")
# 5. Customize labels for the boxplots
box_1 +
labs(x = "Time", y="Scaled Intensity", color="Treatment Group",
title = "Metabolite boxplots for the Gamma-glutamyl Amino Acid Subpathway")
# 6. Save boxplots. You can change the file type for example change pdf to png.
ggsave(filename = paste0("../Outputs/Plots/Metabolite_boxplots_",subpath,".pdf"))
## Saving 10 x 10 in image
Another useful visualization is line plots. These plots are useful for seeing trends between an ordered variable and the scaled intensity. Line plots can be useful for analyzing trends across the different treatment groups. Like the box plots, we are interested in seeing trends for all metabolites within a sub-pathway. To do this, we can utilize the “subpathway_lineplots” function defined in the R scripts. This function takes the following arguments.
Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the subpathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Pathway (pathway): This is the name of the subpathway of interest.
X: This the name of the variable in the metadata that is used for the X axis of the line plots.
Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.
Below is a demonstration of utilizing the line plots to see trends between treatment groups at the metabolite level. In the chunk below we:
Read in the analysis data
Read in the chemical annotation data
Prep data for line plots. To produce the lines we must provide a numeric variable for the x-axis. If you are looking at trends across an ordinal categorical variable you will need to convert the variable to be numeric. For example, if we want to see the trend in time between treatment groups, where time takes the ordinal values of “Pre-symptomatic”, “onset of symptoms”, and “end of symptoms”, then we will need to assign these names to the values of 1,2, and 3 respectively. Additionally, we will need to make any other desired adjustments to our data at this step.
Define the sub-pathway of interest.
Run the “subpathway_lineplots function”
Edit the aesthetics of plots. Similar to the box plots, we are using the labs function to edit the labels for the plots. Additionally, we are utilizing the scale_x_continuous function at add labels to the tick marks for the plot that correspond with the original values.
Save the plots in the folder “Outputs/Plots” under the file name “linePlots_{{subpathway}} where {{subpathway}} is the name of the subpathway defined in step 4.
# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# 2. Reach in chemical annotation data
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 3. Prep data for line plots
analysis_data <- analysis_data %>%
filter(Gender=="Male") %>%
mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
TIME1 =="Onset" ~ 2,
TIME1 == "End" ~ 3))
# 4. Define subpathway of interest.
subpathway = "Gamma-glutamyl Amino Acid"
# 5. Run subpathway_line plots function.
line_plots <- subpathway_lineplots(analysis_data = analysis_data,
chem_data = chem_data,
subpathway = subpathway,
X = TIME1,
groupBy = GROUP_NAME)
# 6. Edit asthetics of plots
line_plots +
scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")
## `geom_smooth()` using formula = 'y ~ x'
# 7. Save plots
ggsave(filename = paste0("../Outputs/Plots/LinePlots_",subpathway,".pdf"))
## Saving 12 x 12 in image
## `geom_smooth()` using formula = 'y ~ x'
We may also want to stratify the line plots by a specific condition. In the example below we stratify the line plots by gender in 10 steps.
Read in the analysis data
Read in the chemical annotation data
Define the subpathway of interest.
Prep data for line plots and filter on specific conditions in the data. In this example we are filtering the data on males. Additionally, the same processing steps from above need to be completed.
Filter analysis data on second condition following the same instructions from step 3. In this example we are filtering the data on females.
Run the subpathway_lineplots function for males
Run the subpathway_lineplots function for females
Use the ggarrange function to combine these two plots into one.
Save the plot in Outputs/Plots folder with the LinePlots_{{subpathway}}_stratified.
# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)
# 2. Reach in chemical annotation data
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 3. Define subpathway of interest.
subpathway = "Gamma-glutamyl Amino Acid"
# 4. Prep data for line plots
analysis_data_male <- analysis_data %>%
filter(Gender=="Male") %>%
mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
TIME1 =="Onset" ~ 2,
TIME1 == "End" ~ 3))
# 5. Prep data for line plots
analysis_data_female <- analysis_data %>%
filter(Gender=="Female") %>%
mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
TIME1 =="Onset" ~ 2,
TIME1 == "End" ~ 3))
# 6. Create plot for males
males <- subpathway_lineplots(analysis_data = analysis_data_male,
chem_data = chem_data,
subpathway = subpathway,
X = TIME1,
groupBy = GROUP_NAME)+
scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")
# 7. Create plot for females
females <- subpathway_lineplots(analysis_data = analysis_data_male,
chem_data = chem_data,
subpathway = subpathway,
X = TIME1,
groupBy = GROUP_NAME)+
scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")
# 8. Display stratified plot
ggarrange(males,
females,
labels = c("Males", "Females"),
nrow = 2,
common.legend = T,
vjust = 0
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# 9. Save plots
ggsave(filename = paste0("../Outputs/Plots/LinePlots_",subpathway,"_stratified.pdf"))
## Saving 12 x 12 in image